Skip to content

Phase 1.3.D + 1.3.E: text + explicit strategies + CLI (v0.3.0)#4

Merged
hallelx2 merged 3 commits into
mainfrom
feat/text-and-explicit-strategies
May 27, 2026
Merged

Phase 1.3.D + 1.3.E: text + explicit strategies + CLI (v0.3.0)#4
hallelx2 merged 3 commits into
mainfrom
feat/text-and-explicit-strategies

Conversation

@hallelx2
Copy link
Copy Markdown
Owner

Summary

Completes pdfplumber parity for the four canonical table-finding strategies. Ships:

  • text strategy — column boundaries inferred from clusters of words sharing X0/X1/centre; row boundaries from clusters sharing top-Y. Direct port of pdfplumber's words_to_edges_v / words_to_edges_h. Tunable via MinWordsVertical (default 3) / MinWordsHorizontal (default 1).
  • explicit strategy — caller-supplied edges via TableSettings.ExplicitVerticalLines / ExplicitHorizontalLines. At least two coordinates required per axis (matches pdfplumber); non-finite values dropped with a log warning.
  • Mixed strategies — every combination of the four strategies works across the two axes (16 combinations).
  • pdftable CLIcmd/pdftable/main.go with extract <file.pdf> [flags] mirroring pdfplumber's CLI surface. Stdlib flag only; no new go.mod dependencies.
  • README + CHANGELOG — side-by-side pdfplumber → pdftable snippets for all four strategies, mixed-strategy example, CLI section with full flag table, v0.3.0 changelog entry.

What's in

  • finder_text.gowordsToEdgesV, wordsToEdgesH, explicitVerticalEdges, explicitHorizontalEdges, validateExplicitForStrategy.
  • internal/layout/lines.go — new SourceText enum value.
  • finder.goensureSupportedStrategies reworked to only reject unknown strings (all four strategies now valid).
  • page.gofindTableEdges refactored to per-axis strategy dispatch via a new baseEdges helper. FindTables/ExtractTables invoke validateExplicitForStrategy.
  • table.go — updated docs on TableStrategy constants, MinWordsVertical/Horizontal, and ExplicitVerticalLines/HorizontalLines.
  • cmd/pdftable/main.go — CLI implementation.
  • testdata/fixtures.go — new TableBorderless() helper (3-column borderless table) used by unit + parity tests.
  • scripts/capture_pdfplumber_text_golden.py — captures pdfplumber's find_tables({text, text}) output for fixtures with a sibling .tables-text.target marker.

What's out

  • Cell text byte-equality with pdfplumber on PDFs whose standard-14 fonts don't have bundled metrics — same documented v0.2.x limitation, deferred to v0.4.x's AFM bundle.
  • Cropped-page support (page.crop()) — used by some pdfplumber text-strategy tests (e.g. nics-background-checks-2015-11); out of scope here.

Test plan

  • go build ./... clean.
  • go vet ./... clean.
  • go test -count=1 ./... — 98 passing, 0 failing across all packages.
  • Unit tests for wordsToEdgesV / wordsToEdgesH on hand-crafted Word slices (alignment, threshold, empty).
  • Unit tests for explicitVerticalEdges / explicitHorizontalEdges (NaN/Inf filtering, source tagging).
  • Unit test for validateExplicitForStrategy (≥2 coords required when strategy is explicit).
  • End-to-end ExtractTables tests against TableBorderless() for text-only, explicit-only, and mixed strategies.
  • pdfplumber parity fixture table-3x4-borderless.pdf matches cell-for-cell against find_tables({text, text}): 1 table, 7 rows × 3 cols (header + 3 alternating data/empty rows).
  • CLI tests: JSON output schema, text-strategy propagation through flags, --pages filtering, missing-file error, mutually-exclusive --tables --text, parsePages parser correctness, reorderFlagsLast flag-order normalisation.
  • Acceptance gate: go build ./cmd/pdftable && ./pdftable extract testdata/golden/issue-466-example.pdf --tables --format json produces valid JSON with 2 detected tables.

Parity fixtures matching

  • table-3x4-borderless (text strategy, both axes) — 1 table × 7 rows × 3 cols, cell-for-cell match.
  • All v0.2.0 line-strategy goldens (issue-466-example, hello, rules, simple1) continue to pass.

Roughly added

~2,150 lines (1,146 + 737 + ~270) across implementation, CLI, tests, scripts, and docs.

Do not merge

Awaiting review.

hallelx2 added 3 commits May 27, 2026 02:07
Implements the two remaining pdfplumber table-finding strategies:

- text: infer column boundaries by clustering words on X0 / X1 /
  centre; infer row boundaries by clustering on top-Y. Direct port of
  pdfplumber's words_to_edges_v / words_to_edges_h with the same
  MinWordsVertical (3) / MinWordsHorizontal (1) defaults.
- explicit: caller-supplied edges via TableSettings.Explicit*Lines.
  At least two coordinates required per axis (matches pdfplumber's
  validation); non-finite values dropped with a log warning.

Each axis selects its strategy independently, so mixed-strategy
settings (e.g. vertical=text + horizontal=lines) work out of the box.

- New layout.SourceText enum tagging text-derived edges.
- Page.findTableEdges refactored to dispatch per-axis on strategy
  instead of starting from a single primitive-edge slice.
- ensureSupportedStrategies now only rejects unknown strategy strings.
- New table_test.go cases: unit tests on hand-crafted Words slices;
  borderless / explicit / mixed extraction end-to-end on the new
  testdata.TableBorderless() fixture.
- pdfplumber parity test for the borderless fixture
  (TestGoldenTablesTextStrategyAgainstPdfplumber) — matches
  cell-for-cell against pdfplumber's find_tables({text, text}).
- scripts/capture_pdfplumber_text_golden.py captures the
  text-strategy expectation for any fixture with a sibling
  .tables-text.target marker.
Adds cmd/pdftable, a stdlib-only command-line interface mirroring
pdfplumber's CLI surface for the operations the library implements:

- extract <file.pdf> [flags]: tables (--tables) or text (--text) on
  one page, a range (--pages 1,3-5), or all pages.
- Output format selectable via --format json|text. JSON shape includes
  page dimensions, table bbox, per-cell bbox, and rows.
- Full TableSettings surface exposed as flags:
  --vertical-strategy / --horizontal-strategy, --snap-tolerance,
  --join-tolerance, --edge-min-length, --intersection-tolerance,
  --text-tolerance, --min-words-vertical/horizontal,
  --explicit-vertical-lines/horizontal-lines, --indent.
- Positional argument can appear before OR after flags
  (pdfplumber-style invocation); reorderFlagsLast() shuffles tokens
  so the standard library flag package can parse either ordering.

Tested via cmd/pdftable/main_test.go: end-to-end runs against the
issue-466-example and table-3x4-borderless fixtures, plus unit tests
on parsePages, reorderFlagsLast, and the error paths.

No new go.mod dependencies — uses standard library flag, encoding/json,
strings, strconv only.
- CHANGELOG.md: v0.3.0 entry covering text + explicit strategies,
  mixed-strategy support, the pdftable CLI, the layout.SourceText
  enum, and the borderless parity fixture. Known limitations note
  the carried-over font-metric drift on cell text.
- README.md: status bumped to v0.3.0; "Tables" section reworked with
  side-by-side pdfplumber → pdftable snippets for all four
  strategies plus a mixed-strategy example; new "CLI" section
  documenting the extract subcommand and full flag table; roadmap
  reflects v0.4.x as the AFM-bundle phase.
Copilot AI review requested due to automatic review settings May 27, 2026 01:09
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @hallelx2, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

Warning

Review limit reached

@hallelx2, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 51 minutes and 58 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: abc889b0-0159-4425-970a-e0fb7707406f

📥 Commits

Reviewing files that changed from the base of the PR and between 599c309 and 0ab1e7e.

⛔ Files ignored due to path filters (1)
  • testdata/golden/table-3x4-borderless.pdf is excluded by !**/*.pdf
📒 Files selected for processing (16)
  • CHANGELOG.md
  • README.md
  • cmd/pdftable/main.go
  • cmd/pdftable/main_test.go
  • finder.go
  • finder_text.go
  • golden_test.go
  • internal/layout/lines.go
  • page.go
  • scripts/capture_pdfplumber_text_golden.py
  • scripts/gen_table_fixture.go
  • table.go
  • table_test.go
  • testdata/fixtures.go
  • testdata/golden/table-3x4-borderless.tables-text.expected.json
  • testdata/golden/table-3x4-borderless.tables-text.target
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/text-and-explicit-strategies

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@hallelx2 hallelx2 merged commit 87b453a into main May 27, 2026
5 of 6 checks passed
@hallelx2 hallelx2 deleted the feat/text-and-explicit-strategies branch May 27, 2026 01:10
@hallelx2 hallelx2 review requested due to automatic review settings May 27, 2026 01:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant